8 research outputs found
GPOP: A cache- and work-efficient framework for Graph Processing Over Partitions
Past decade has seen the development of many shared-memory graph processing
frameworks, intended to reduce the effort of developing high performance
parallel applications. However many of these frameworks, based on
Vertex-centric or Edge-centric paradigms suffer from several issues, such as
poor cache utilization, irregular memory accesses, heavy use of synchronization
primitives and theoretical inefficiency, that deteriorate overall performance
and scalability.
Recently, we proposed a cache and memory efficient partition-centric paradigm
for computing PageRank. In this paper, we generalize this approach to develop a
novel Graph Processing Over Partitions (GPOP) framework that is
cache-efficient, scalable and work-efficient. GPOP induces locality in memory
accesses by increasing granularity of execution to vertex subsets called
'partitions', thereby dramatically improving the cache performance of a variety
of graph algorithms. It achieves high scalability by enabling completely lock
and atomic free computation. GPOP's built-in analytical performance model
enables it to use a hybrid of source and partitioncentric communication modes
in a way that ensures work-efficiency each iteration, while simultaneously
boosting high bandwidth sequential memory accesses.
We extensively evaluate the performance of GPOP for a variety of graph
algorithms, using several large datasets. We observe that GPOP incurs up to 9x,
6.8x and 5.5x less L2 cache misses compared to Ligra, GraphMat and Galois,
respectively. In terms of execution time, GPOP is upto 19x, 9.3x and 3.6x
faster than Ligra, GraphMat and Galois respectively.Comment: 23 pages, 7 figures, 4 table
PolarStar: Expanding the Scalability Horizon of Diameter-3 Networks
In this paper, we present PolarStar, a novel family of diameter-3 network
topologies derived from the star product of two low-diameter factor graphs. The
proposed PolarStar construction gives the largest known diameter-3 network
topologies for almost all radixes. When compared to state-of-the-art diameter-3
networks, PolarStar achieves 31% geometric mean increase in scale over
Bundlefly, 91% over Dragonfly, and 690% over 3-D HyperX.
PolarStar has many other desirable properties including a modular layout,
large bisection, high resilience to link failures and a large number of
feasible sizes for every radix. Our evaluation shows that it exhibits
comparable or better performance than other diameter-3 networks under various
traffic patterns.Comment: 13 pages, 13 figures, 4 table
SPEC2: SPECtral SParsE CNN Accelerator on FPGAs
To accelerate inference of Convolutional Neural Networks (CNNs), various
techniques have been proposed to reduce computation redundancy. Converting
convolutional layers into frequency domain significantly reduces the
computation complexity of the sliding window operations in space domain. On the
other hand, weight pruning techniques address the redundancy in model
parameters by converting dense convolutional kernels into sparse ones. To
obtain high-throughput FPGA implementation, we propose SPEC2 -- the first work
to prune and accelerate spectral CNNs. First, we propose a systematic pruning
algorithm based on Alternative Direction Method of Multipliers (ADMM). The
offline pruning iteratively sets the majority of spectral weights to zero,
without using any handcrafted heuristics. Then, we design an optimized pipeline
architecture on FPGA that has efficient random access into the sparse kernels
and exploits various dimensions of parallelism in convolutional layers.
Overall, SPEC2 achieves high inference throughput with extremely low
computation complexity and negligible accuracy degradation. We demonstrate
SPEC2 by pruning and implementing LeNet and VGG16 on the Xilinx Virtex
platform. After pruning 75% of the spectral weights, SPEC2 achieves 0% accuracy
loss for LeNet, and <1% accuracy loss for VGG16. The resulting accelerators
achieve up to 24x higher throughput, compared with the state-of-the-art FPGA
implementations for VGG16.Comment: This is a 10-page conference paper in 26TH IEEE International
Conference On High Performance Computing, Data, and Analytics (HiPC
Quickly Finding a Truss in a Haystack
The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerous formulations and algorithms for finding the maximal k-truss of a graph, many of these tend to be computationally expensive and do not scale well. Many algorithms are iterative and use static graph triangle counting in each iteration of the graph. In this work we present a novel algorithm for finding both the k- truss of the graph (for a given k), as well as the maximal k-truss using a dynamic graph formulation. Our algorithm has two main benefits. 1) Unlike many algorithms that rerun the static graph triangle counting after the removal of nonconforming edges, we use a new dynamic graph formulation that only requires updating the edges affected by the removal. As our updates are local, we only do a fraction of the work compared to the other algorithms. 2) Our algorithm is extremely scalable and is able to concurrently detect deleted triangles in contrast to past sequential approaches. While our algorithm is architecture independent, we show a CUDA based implementation for NVIDIA GPUs. In numerous instances, our new algorithm is anywhere from 100X-10000X faster than the Graph Challenge benchmark. Furthermore, our algorithm shows significant speedups, in some cases over 70X, over a recently developed sequential and highly optimized algorithm
A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network
Novel low-diameter network topologies such as Slim Fly (SF) offer significant
cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To
spearhead the adoption of low-diameter networks, we design, implement, deploy,
and evaluate the first real-world SF installation. We focus on deployment,
management, and operational aspects of our test cluster with 200 servers and
carefully analyze performance. We demonstrate techniques for simple cabling and
cabling validation as well as a novel high-performance routing architecture for
InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF's
strong performance for many modern workloads such as deep neural network
training, graph analytics, or linear algebra kernels. SF outperforms
non-blocking Fat Trees in scalability while offering comparable or better
performance and lower cost for large network sizes. Our work can facilitate
deploying SF while the associated (open-source) routing architecture is fully
portable and applicable to accelerate any low-diameter interconnect
In-network Allreduce with Multiple Spanning Trees on PolarFly
Allreduce is a fundamental collective used in parallel computing and distributed training of machine learning models, and can become a performance bottleneck on large systems. In-network computing improves Allreduce performance by reducing packets on the fly using network routers. However, the throughput of current innetwork solutions is limited to a single link bandwidth. We develop, compare and contrast two different sets of Allreduce spanning trees embedded into PolarFly, a high-performance diameter-2 network topology. Both of our solutions offer theoretically guaranteed near-optimal performance, boosting Allreduce bandwidth by a factor equal to half the network radix of nodes. While our first set offers low-latency with trees of depth-3, the second set offers congestion-free implementation which reduces complexity and resource requirements of in-network computing units. In doing so, we also distinguish PolarFly as a highly suitable network for distributed deep learning and other applications that employ throughput-bound large Allreductions